Pattern Recognition Chapter 1. Notes

1.0 Introduction

  • Handwriting is a non-trivial problem due to variability.
  • Why is handcrafting or using heuristics bad?
    • such an approach leads to proliferation (rapid increase) of rules and exceptions
    • i.e. can we model all the possible transformation of a given number?
  • generalization: ability to categorize new examples that differ from training set
  • Input variables are pre-processed to new space of variables, so that the problem is easier to solve
    • I.e. MNIST digits are normalized and shifted towards centre (reducing positional variability)
    • pre-processing is sometimes called 'feature extraction'
  • Useful features are fast to compute, and preserve discriminatory information.
  • Feature extration is a form of dimensionality reduction.
  • Need to be careful of the balance of the power of the feature and computation power

  • 3 kinds of machine learning problems:

    • supervised: (classification) (regression)
    • un-supervised: (clustering) (density estimation) (dimensionality reduction)
    • reinforcement learning: problem of finding suitable actions to take in a given example in order to maximize a reward (exploration vs. exploitation)

1.1 Example: Polynomial Curve Fitting

Here we will try to recreate the example given in the textbook.

  • $x$ is generated as a set of numbers from $[0, 1]$ with $n$ divisions
  • target $t$ is generated from $sin(2\pi x)$ with a gaussian distribution as noise

  • what we want to do is try to use this $x$ to try and see if we can make predictions of the value $\hat{t}$ for some new input value $\hat{x}$

  • we are trying to discover the inner oracle function
  • Probabilitiy theory gives a framework for expressing such uncertainty in precise and quantitave manner
  • Decision theory allows us to use this probabilistic representation to make predictions that are optimal

Here we describe a polynomial function which is nonlinear function of $x$, but linear function of $w$. Linear parameters describe a set of functions that have important properties are called linear models.

$$ \begin{equation} y(x,W) = w_0 + w_1 x + w_2 x^2 + ... + w_M x^M = \sum^{M}_{j=0} w_j x^j \end{equation} $$

We need Error Functions to give an estimate of how close the values are compared to the given target values. Note that an error function is a function of the learned parameter, not input nor output. Here is an example of SSE (sum of squares of error). Here we will need to come up with parameter $w$ such that we minimize the error between the given input and the target output.

$$ E(w) = \frac{1}{2} \sum^{N}_{n=1}(y(x_n, W) - t_n)^2 $$

Note: Non-negative, Convex


In [1]:
%matplotlib inline
import matplotlib
import numpy as np
from numpy import random
import matplotlib.pyplot as plt

# Generate a example function sin(2*pi*x)
num_points = 10
x = np.linspace(0., 1., num_points)
t = np.sin(2.0 * np.pi * x) + np.random.normal(0, 0.15, num_points)

x_dense = np.linspace(0., 1., 100)
truth = np.sin(2.0 * np.pi * x_dense)

# Plot
plt.plot(x_dense, truth, 'g')
plt.plot(x, t, 'ro')
plt.title('Example 1.1');


  • We can solve for the optimal $w$ by finding the derivative of the error function. Since, the error function is linear in terms of w, $E(w)$, the output has a unique solution.

  • Note that there is a problem of picking M. This is a process called Model Selection, since we change the family of function used to try to learn the 'oracle' function.

Model Selection

  • note that M = 0,1 is a horizontal line, and linear line that has minimum error
  • however, it simply does not have the capability to represent the oracle function
  • when the M is high, it overfits, even if Error is zero!
  • when M is low, it underfits and we can see that Error is quite high.

RMS Error

$$ E_{RMs} = \sqrt{2E(\bar{w})/N} $$
  • for M = 10 case, the model has 10 degrees of freedom, thus can memorize the points. However, once test dataset is used, the error is high. (The high capacity model tries to measure the noise)
  • having more training data also helps interms of reducing the over-fitting problem
  • roughly the number of parameter given the data points shoud no less be some multiple of the parameter, but number of parameter isn't always the best indication of model complexity

Complexity of the Model

  • it is noted that the complexity of the model be picked, according to the complexity of the problem being solved
  • least squares approach represents maximum liklihood
  • bayesian approach can be used to avoid the overfitting, due to the fact that it can adaptively choose the number of parameters

Solving Practical Application. Need to find way to find a suitable value for the model complexity. Simple method to is to keep a training data with validation set to fine-tune the model complexity.

Regularization

  • method to control over-fitting phenomenon

i.e.

$$ \tilde{E}(\boldsymbol{w}) = \frac{1}{2}\sum^{N}_{n=1}\{y(x_n, \boldsymbol{w}) - t_n\}^2 + \frac{\lambda}{2}\lVert\boldsymbol{w}\rVert_2 $$
  • this prevents the model to learn large weights
  • regularization also controls the effective complexity of the model

1.2 Probability Theory

  • provides framework of quantifying and manipulation of uncertainty
  • combined with decision theory, we can make optimal prediction with uncertain data

  • Note: by definition, Probabilities need to lie in between [0, 1]

  • Sum of the probabilities of all mutually exclusive events must sum to 1
  • Mutual Exclusivity: where two events cannot both occur. (i.e. a coin toss giving both head and tail)
  • Joint Probability: The probability of 2 non-mutually exclusive event occuring. For example, let $X, Y$ describe real numbers divided by $M, N$ specific outcomes, respectively. We can descibe the probability of a certain event in x and y to occur as: $$ p(X = x_i, Y = y_i) $$
  • Sum Rule: Specifies that a maginal probability, $p(X=x_i)$, can be obtained by summing out the other variables. (in this case Y). In another words, the probability of one event can be obtained by summing each other event. $$ p(X=x_i) = \sum^{N}_{j=1}p(X=x_i, Y=y_j) $$
  • Conditional Probability: The probability of an event given information of a non-mutually exclusive event. $$ p(X=x_i | Y=y_i) $$
  • Product Rule: used to determine the probability of 2 independent events, by multiplying the other events by their marginal values $$ p(X=x_i, Y=y_i) = p(Y=y_i|X=x_i)p(X=x_i) $$

  • Symmetry Property $$ p(X,Y) = p(Y,X) $$

  • Bayes' Theorem: equation to calculate the posterior probability (probability given the observations). $$ p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)} $$

  • Prior: probability available before we make observations, $p(Y)$

  • Liklihood: conditional probability that gets affected from the prior information, $p(X|Y)$
  • Independence: if the given events are independent, one event does not affect the outcome probability of the other. I.e. if the chance/liklihood of an event X is the same for every possible outcome of Y, $p(X|Y) = p(X)$.

1.2.1 Probability Densities

  • Probability density is the area under a probability distribution $$ p(x \in (a,b)) = \int^{b}_{a}p(x)dx $$
  • Note: probability is non-negative, and sum to 1 $$ p(x) \geq 0, \int^{-\infty}_{\infty}p(x)dx = 1 $$
  • Change of variable
    • let $x = g(y)$
    • then $p_x(x) = p_x(g(y))$ and $p_y(y) = p_x\circ g(y)$
$$ p_y(y) = P_y'(y) = \frac{d}{dy}P_y(y) = \frac{d}{dy}P_x(g(y)) = p_x(g(y))\big|\frac{dg(y)}{dy}\big| = p_x(g(y))\big|\frac{dx}{dy}\big| $$
  • Note that the derivatives are absolute value, since cumulative ditribution functions cannot be negative.
  • Cumulative Distribution Function: probability that x lies in some interval $(-\infty, z)$ $$ P(z) = \int^{z}_{-\infty}p(x)dx $$

  • For multivariate probability, where $\boldsymbol{x} = (x_1, x_2, ... , x_n)$ and the probability turns into a joint probability with the vector, $p(\boldsymbol{x} = p(x_1, x_2, ... , x_n)$, the above rule for total probability and summation needs to hold.

$$ p(\boldsymbol{x}) \geq 0, \int p(\boldsymbol{x})d\boldsymbol{x} = 1 $$
  • For discrete case, it is called probability mass function

  • For probability distributions, it is also noted that they follow the product and summation rule

$$ p(x) = \int p(x,y)dy $$$$ p(x,y) = p(y|x)p(x) $$

1.2.2 Expectations and Covariances

  • weighted average of functions is very important operation in probabilities
  • average of a function $f(x)$ under a probability distribution $p(x)$ is called expectation $\mathbb{E}[f]$ given by:
$$ \mathbb{E}[f] = \sum_x p(x)f(x) $$
  • or in continuous domain
$$ \mathbb{E}[f] = \int p(x)f(x)dx $$
  • if we are given the number of sample points over the distribution, the expectation can be approximated as
$$ \mathbb{E}[f] = \frac{1}{N}\sum^{N}_{n=1}f(x_n) $$
  • if we are given a multivariate function, the expectation of the multivariate function with respect to a single variable would be:
$$ \mathbb{E}_x[f(x,y)] $$
  • This function would be a function of y at this point.
  • Conditional expectation has the form of:
$$ \mathbb{E_x}[f|y] = \sum_{x}p(x|y)f(x) $$
  • Variance is a measure of spread of a distribution
$$ var[f] = \mathbb{E}[(f(x) - \mathbb{E}[f(x))^2] $$
  • Few more nice formulation of variance:
$$ var[x] = \mathbb{E}(x^2) - \mathbb{E}(x)^2 $$$$ cov[x, y] = \mathbb{E}[(x - \mathbb{E}[x])(y - \mathbb{E}[y])] = \mathbb{E}(xy) - \mathbb{E}(x)\mathbb{E}(y) $$
  • Note for 2 vectors of random variables, it is a matrix
$$ cov[\boldsymbol{x}, \boldsymbol{y}] = \mathbb{E}[(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])(\boldsymbol{y}^T - \mathbb{E}[\boldsymbol{y}^T])] = \mathbb{E}(\boldsymbol{x}\boldsymbol{y}^T) - \mathbb{E}(\boldsymbol{x})\mathbb{E}(\boldsymbol{y}^T)$$

1.2.3. Bayesian Probabilities

  • so far, we viewed probability as frequencies of random repeatable objects (classical, frequentist)
  • now, we take a look at Bayesian view which probabilities prove quantification of uncertainty
  • Bayesian interpretation of probability allows us to:
    • quantify our unccertainty and make precise updates to uncertainty with evidences
    • as well as take optimal actions or decision as consequence
  • bayesian vs. frequentist

    • frequentist sets the weight/(true value) fixed and and tries to estimate the best possible parameter by considering the fact that there is a distribution of possible dataset
    • bayesians say there is only one dataset and the uncertainty in the parameter is expressed through distribution over the model

    • Frequentist consider measure of the frequency of repeated events, thus, model parameters are fixed and data to be random

    • Bayesians consider the probability as a measure of degree of certainty about the values and thus model parameters to be random and the data fixed.

In [ ]: